Viewing sentence boundary detection as collocation identification
نویسندگان
چکیده
The detection of abbreviations is an important step in the process of sentence boundary detection. We describe a flexible, languageindependent and accurate method based on the idea that an abbreviation can be viewed as a collocation. As such, it can be identified by using methods for collocation detection such as the log likelihood ratio. Although the log likelihood ratio is known to show a good recall, its precision is poor. We employ scaling factors that lead to a strong improvement of precision. Experiments with English and German corpora show that abbreviations can be detected with high accuracy. We also show that inaccurate tokenization leads to a considerably higher error rate during tagging.
منابع مشابه
Unsupervised Multilingual Sentence Boundary Detection
In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using thre...
متن کاملAn Algorithm Combining Statistics-based and Rules-based for Chunk Identification of Chinese Sentences
Natural language processing (NLP) is a very hot research domain. One important branch of it is sentence analysis, including Chinese sentence analysis. However, currently, no mature deep analysis theories and techniques are available. An alternative way is to perform shallow parsing on sentences which is very popular in the domain. The chunk identification is a fundamental task for shallow parsi...
متن کاملParsing and MWE Detection: Fips at the PARSEME Shared Task
Identifying multiword expressions (MWEs) in a sentence in order to ensure their proper processing in subsequent applications, like machine translation, and performing the syntactic analysis of the sentence are interrelated processes. In our approach, priority is given to parsing alternatives involving collocations, and hence collocational information helps the parser through the maze of alterna...
متن کاملSentence Analysis and Collocation Identification
Identifying collocations in a sentence, in order to ensure their proper processing in subsequent applications, and performing the syntactic analysis of the sentence are interrelated processes. Syntactic information is crucial for detecting collocations, and vice versa, collocational information is useful for parsing. This article describes an original approach in which collocations are identifi...
متن کاملShape Identification Technique for a 2-d Elliptic System by Boundary Integral Equation Method
This paper is concerned with the identification of the geometrical structure of the boundary shape for a two-dimensional boundary value problem. The output least square identification method is considered for estimating partially unknown boundary shapes. A numerical parameter estimation technique using the spline collocation method is proposed. lThis research was supported by the National Aeron...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002